Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dead Page Reference Count Bug Fix #181

Open
wants to merge 32 commits into
base: main
Choose a base branch
from
Open

Conversation

bhaskarbora2
Copy link
Collaborator

@bhaskarbora2 bhaskarbora2 commented Dec 18, 2024

This MR is to release the fix for ref count bug in page recycler's dead page recycling flow. This issue crops up when volume trimmer and page recycler retries a dead page recycling task for a page after recovery. The solution is to track largest_slot_offset in recycler which will be checked at the time of receiving any new dead_page recycling request from volume trimmer. Any request having lower or equal slot_offset compare to largest-slot-offset will be ignored.

The MR also adds in a histogram plot to show where all we are hitting the bug over the entire seed range. This is done through an existing page ref count test.

#13

@bhaskarbora2 bhaskarbora2 self-assigned this Dec 18, 2024
@bhaskarbora2 bhaskarbora2 changed the title Page Reference Count Bug Fix Dead Page Reference Count Bug Fix Dec 18, 2024
src/llfs/page_recycler_events.hpp Outdated Show resolved Hide resolved
src/llfs/page_allocator_state.hpp Outdated Show resolved Hide resolved
src/llfs/page_recycler.hpp Outdated Show resolved Hide resolved
src/llfs/page_recycler.hpp Outdated Show resolved Hide resolved
src/llfs/page_recycler.hpp Outdated Show resolved Hide resolved
src/llfs/page_recycler.hpp Show resolved Hide resolved
src/llfs/page_recycler.hpp Show resolved Hide resolved
Copy link
Collaborator

@tonyastolfi tonyastolfi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment: on functions that return Status or StatusOr, usually it's best to declare as noexcept. My advice would be to do this on functions whose signature we modify, or newly added functions; as opposed to just going through and changing a bunch of existing signatures.

@@ -55,13 +55,21 @@ class PageRecycler
CountMetric<u64> page_drop_error_count{0};
};

struct GlobalMetrics {
CountMetric<u32> page_id_deletion_reissue_count{0};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not u64 here? Not that we expect a lot of these, but why not?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning was this. Say we do recovery every second and we dedup all these requests then we could do that for next ~50,000 years with a 32-bit counter. Why allocate more space when we are never going to hit the limit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's worth drawing attention to, in order to save 4 bytes.

src/llfs/page_recycler.hpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@tonyastolfi tonyastolfi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's what I have so far... I need to review the page_recycler.cpp changes in more detail. Will post another review later.

src/llfs/page_recycler_events.hpp Outdated Show resolved Hide resolved
@@ -132,12 +142,14 @@ struct PackedPageToRecycle {

little_page_id_int page_id;
little_u64 batch_slot;
u64 offset_as_unique_identifier;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use little_u64 here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure... this part of the code is all changed now.

src/llfs/page_recycler_events.hpp Outdated Show resolved Hide resolved
@@ -111,13 +113,28 @@ Optional<SlotRange> PageRecyclerRecoveryVisitor::latest_info_refresh_slot() cons
return this->latest_info_refresh_slot_;
}

//==#==========+==+=+=++=+++++++++++-+-+--+----- --- -- - - - -
//
slot_offset_type PageRecyclerRecoveryVisitor::volume_trim_offset() const
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe have the member function name match the member data name? volume_trim_slot()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure...thats a good point.

src/llfs/page_recycler_recovery_visitor.cpp Outdated Show resolved Hide resolved
@@ -45,6 +45,9 @@ class PageRecyclerRecoveryVisitor

Optional<SlotRange> latest_info_refresh_slot() const;

slot_offset_type volume_trim_offset() const;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Consider grouping these together as a single struct. The advantages would be:

  • Code that calls this API gets a very strong "hint" that the two always need to be called together (and the pair needs to travel around together)
  • Can define custom operators for that type, to re-use the slot_less_than and lexicographical ordering that is coded inline in a couple places currently.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do the change as discussed...

@@ -29,6 +29,8 @@ namespace llfs {
, latest_batch_{}
, recycler_uuid_{boost::uuids::random_generator{}()}
, latest_info_refresh_slot_{}
, volume_trim_slot_{0}
Copy link
Collaborator

@tonyastolfi tonyastolfi Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider having these be Optional so that we can detect that case where no slot/index information could be recovered from the log. It's unlikely the slot will over overflow (being 64-bits), but making this Optional would also have the benefit of handling integer wrap (like slot_less_than).

Copy link
Collaborator Author

@bhaskarbora2 bhaskarbora2 Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First time when the DB is coming up we will never find a slot/index during recovery. At that point we treat them as 0/0. I am not sure if making it optional will be useful here. Let me know if I missed anything?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants